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Abstract — In  this  work,  we  propose  an  approach,  known  as 
the  C2SS-BF  method,  to  synchronizing  similar  sets  of  data  that 
uses  an  Invertible  Bloom  Filter  (IBF).  The  C2SS-BF  method 
builds  on  previous  work  by  Eppstein  et  al.  in  [6].  By  allowing 
two  rounds  of  communication,  we  show  that  in  many  cases  the 
proposed  approach  requires  substantially  less  throughput  than 
the  algorithm  proposed  in  [6].  The  C2SS-BF  compares  favorably 
to  the  work  by  Guo  and  Li  in  [9],  and,  in  particular,  it  requires 
less  computational  complexity  and  throughput. 

I.  Introduction 

There  has  been  an  increasing  need  to  maintain  a  Common 
Operational  Picture  (COP)  between  a  collection  of  hosts 
within  a  disconnected,  intermittent  and  low-bandwidth  (DIL) 
maritime  environment.  Existing  Command  and  Control  (C2) 
systems  that  use  event-based  protocols  may  conserve  band¬ 
width,  but  they  do  not  guarantee  a  COP  in  a  DIL  environment. 
The  purpose  of  the  Command  and  Control  Data  Synchroniza¬ 
tion  Service  (CS22)  is  to  develop  synchronization  tools  that 
not  only  ensure  synchronization  occurs,  but  also  guarantee 
synchronization  is  achievable  within  a  DIL  environment.  In 
particular,  the  unreliable  nature  of  the  DIL  environment  is 
accounted  for  in  our  framework  by  ensuring  that  our  method 
addresses  the  following  core  properties: 

1)  The  method  uses  minimal  communication  rounds,  and 

2)  The  throughput  between  any  two  hosts  on  the  network 
is  limited. 

The  document  is  organized  as  follows.  In  Section  II,  we 
review  the  current  approach  to  reconciling  data  sets  used  by 
C2SS  and  detail  our  contribution.  In  Section  III  we  define  the 
notation  to  be  used  in  this  paper.  In  Section  IV  we  describe 
our  approach  to  synchronizing  similar  sets  of  data.  In  Sec¬ 
tion  V,  we  provide  simulation  results  comparing  our  proposed 
algorithm  to  existing  approaches.  Section  VI  concludes  the 
paper. 

II.  Current  Approach  to  Reconciling  Data  in  C2SS 
AND  Our  Contribution 

Suppose  there  are  two  hosts  A  and  B  where  Host  A  has 
access  to  the  set  Sa  C  GF(2)^  and  Host  B  has  access  to 
the  set  5b  C  GF(2)^.  The  set  reconciliation  problem  is 
to  determine  which  information  must  be  sent  between  Host 
A  and  Host  B  so  that  each  host  can  compute  Sa  U  Sb- 


We  use  the  terms  set  reconciliation,  data  reconciliation,  and 
synchronization  to  refer  to  the  process  by  which  Host  A  and 
Host  B  compute  5^  U  5^. 

In  [6],  [10],  and  [13]  the  authors  considered  the  set  recon¬ 
ciliation  problem  under  the  additional  constraint  that  only  a 
single  round  of  communication  was  allowed.  The  goal  in  this 
work  is  to  provide  a  solution  to  the  set  reconciliation  problem 
that  requires  two  rounds  of  communication.  It  is  also  desirable 
that  any  proposed  algorithm  posses  low  encoding/decoding 
complexity  properties. 

The  current  approach  taken  by  the  C2SS  software  (for 
set  reconciliation)  is  to  use  a  set  of  hashes  along  with  a 
Merkle  tree.  The  hashes  are  used  to  represent  some  unit  of 
information  and  the  Merkle  tree  organizes  the  hashes  in  a 
hierarchical  manner  to  facilitate  comparison.  Lor  shorthand, 
we  refer  to  this  approach  as  the  C2SS-HM  method.  The  C2SS- 
HM  method  has  been  demonstrated  to  provide  very  reliable 
data  synchronization;  however,  it  needs  to  be  optimized  in  the 
following  areas: 

1)  Lor  wide  Merkle  trees,  many  hashes  need  to  be  com¬ 
pared/exchanged  at  the  same  time,  and 

2)  for  tall  Merkle  trees,  set  reconciliation  requires  many 
rounds  of  communication. 

In  this  work,  we  draw  from  the  analysis  provided  in  [12], 
which  also  uses  a  method  similar  to  the  C2SS-HM  method.  In 
the  following  analysis  we  assume  the  Merkle  tree  is  balanced. 
Assuming  w  is  the  width  of  the  tree  and  the  size  of  the 
symmetric  difference  is  d  =  |5a  A  5b|  =  1(5^  \  5b)  U 
(5b  \  5a)|,  it  was  shown  in  [12]  that  the  expected  number 
of  rounds  of  communication  (or  the  height  of  the  tree)  is 
0(2  log^(  j^))  +  0(1)  for  some  positive  integer  d.  We  note 
that  given  the  unreliable  nature  of  a  DIL  environment,  a 
protocol  with  fewer  rounds  of  communication  is  desirable.  In 
addition,  the  C2SS-HM  method  also  requires  maintaining  a 
tree  structure  (preferably  balanced)  in  memory. 

The  purpose  of  this  document  will  be  to  discuss  a  new 
approach  to  set  reconciliation  that  overcomes  several  of  the 
drawbacks  to  the  C2SS-HM  method.  More  specifically,  we 
propose  an  approach  to  set  reconciliation  that  requires  at  most 
two  rounds  of  communication  and  does  not  require  a  tree 
structure.  The  principal  tool  used  in  our  proposed  method  is 
a  Bloom  Lilter  and  so  we  refer  to  our  method  as  the  C255- 
BF  method.  The  basic  idea  behind  the  C2SS-BL  method  is 
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similar  to  that  proposed  in  [6]  and  [9].  We  compute  a  hash  on 
Host  A  and  a  hash  on  Host  B.  Then,  Host  A  and  B  exchange 
their  hashes.  On  Host  A  we  determine  Sa  \Sb  and  similarly 
on  Host  B  we  determine  Sb  \  Sa-  Finally,  the  information 
Sa\Sb  and  Sb\Sa  is  exchanged  between  Host  A  and  Host 
B. 

The  C2SS-BF  method  has  the  following  important  at¬ 
tributes: 

1)  Requires  less  throughput  than  current  approaches  to  set 
reconciliation  ([6],  [13]). 

2)  Requires  only  two  rounds  of  communication. 

In  Section  V  and  Appendix  B,  we  provide  comparisons 
between  existing  approaches  in  the  literature  and  the  proposed 
algorithm  in  this  paper. 

HI.  Some  NOTATION 

The  following  is  a  description  of  the  notations  used  in  the 
remainder  of  this  paper.  We  assume  that  /i , . . . ,  //c  are  hash 
functions  where  for  any  i  G  {1,  2, . . . , /c},  fi  :  GF{2)^ 

{(.  _  .  dQ^  +  (i  _  1)  . 

for  positive  integers  d,  k.  For  an  element  x  G  GF(2)  ,  let 
Aix)  =  {fi{x),  f2{x), . . . ,  fk{x)).  We  refer  to  the  vector 
f^{x)  =  f2{x), . . . ,  fk{x))  as  the  locations  of  the 

element  x. 

IV.  The  C2SS-BF  Method 

In  this  section,  we  begin  by  describing  at  a  high  level 
the  basic  ideas  behind  the  C2SS-BF  method.  Afterwards,  we 
describe  in  detail  the  algorithm  for  computing  Sa^Sb- 

Recall,  our  problem  is  that  given  two  hosts  A  and  B  where 
Host  A  has  access  to  the  set  Sa  F  GF{2)^  and  Host  B 
has  access  to  the  set  5^  ^  GF{2)^  to  determine  which 
information  must  be  sent  between  Host  A  and  Host  B  using 
two  rounds  of  communication  so  that  Host  B  and  Host  A 
can  each  compute  Sa  U  Sb-  Let  t  =  |5b|  >  \Sa\  and  let 
d  =  |5a  A<Sb  I .  We  assume  that  t  and  d  are  known  to  both  Host 
A  and  Host  B.  We  note  that  estimates  of  d  can  be  obtained 
using  sampling  techniques  such  as  those  described  in  [6]  and 
[7].  After  two  rounds  of  communication,  the  result  is  that  Host 
A  and  Host  B  will  recover  Sa  ^Sb  (and  consequently  can 
determine  Sa  U  Sb)  with  probability  0{d~^~^‘^). 

In  Section  IV- A,  we  describe  the  encoding  process  followed 
by  the  decoding  process  in  Section  IV-B.  Afterwards,  we  com¬ 
ment  on  the  encoding/decoding  complexity  and  the  throughput 
required  for  the  C2SS-BF  method. 

The  algorithm  proceeds  as  follows: 

1)  Host  A  inserts  all  the  elements  in  Sa  into  a  Bloom  Filter 
(BF),  denoted  hs^  -  Host  B  inserts  the  elements  from  Sb 
into  a  Bloom  Filter,  denoted  Hsb  - 

2)  Host  A  transmits  to  Host  B  and  Host  B  transmits 
hss  to  Host  A. 

3)  Host  B  receives  hs^  and  uses  it  to  compute  Sb  \Sa, 
and  Host  A  receives  hs^  and  uses  it  to  compute  Sa\Sb- 


4)  Host  B  transmits  Sb  \  Sa  to  Host  A,  and  Host  A 
transmits  Sa\Sb  to  Host  B. 

Since  Sa^Sb  =  5a U  {Sb \Sa)  =  5b  U  (5^  \5b),  after  the 
completion  of  step  4),  both  Host  A  and  Host  B  can  determine 
5a  U  Sb- 

The  computation  of  hsj^  and  hs^  are  identical  and  are 
described  in  Section  IV-A. 

A.  Encoding 

On  each  host,  we  begin  by  creating  a  special  type  of  Bloom 
Filter  known  as  an  Invertible  Bloom  Filter  (IBF).  Our  IBF  is 
comprised  of  a  collection  of  n  =  d{k  F  1)  cells  where  d  = 

I  5a  A  5b  I  and  k  is  some  integer.  We  assume,  for  convenience, 
that  n  is  a  power  of  two.  As  before,  we  refer  to  the  IBF  on 
Host  A  as  hsj^  and  similarly  we  refer  to  the  IBF  on  Host  B 
as  Hsb-  We  use  the  terms  hash  and  IBF  interchangeably  for 
the  remainder  of  the  paper. 

To  encode  hs^  (and  similarly  for  hs^ )  we  simply  insert  all 
the  elements  in  5a  (or  5b)  into  an  IBF.  In  the  following,  we 
let  5  =  5a  if  the  procedure  is  being  performed  on  Host  A 
and  5  =  5b  if  the  procedure  is  being  performed  on  Host  B. 

When  an  element  is  inserted  into  an  IBF  it  is  hashed  to  k 
different  cells,  and  we  assume  the  hash  functions  /i, . . . ,  //c 
are  the  same  on  both  Host  A  and  Host  B.  Let  g  be  a  positive 
integer  to  be  defined  later.  Each  cell  contains  two  fields: 

1)  c:  an  integer  which  is  simply  the  number  of  times  the 
cell  has  been  hashed  to  modulo  q. 

2)  1:  the  sum  of  all  the  element  locations  that  have  hashed 
into  the  cell  modulo  n. 

To  fix  ideas,  we  include  the  encoding  algorithm  for  the 
C2SS-BF  method  below  along  with  an  example.  We  refer  to 
k-th  cell  in  the  IBF  below  as  hs  [k] .  We  assume  that  hs  [k]  is 
initialized  so  that  the  count  field  for  every  cell  is  zero  and  the 
locations  field  is  simply  all-zeros. 


Algorithm  1:  C2SS-BF  Encode 

input  :  The  set  5  C  GF{2)^ 
output:  The  IBF  hs 

1  for  every  x  G  5  do 

2  for  i  =  1  :  /c  do 

3  hs[fi{x)].c  =  hs[fi{x)].cFl  mod  g; 
hs[fi{x)].l  =  hs[fi{x)].l  F  f^{x)  mod  n; 

4  end 

5  end 


Example d  =  A,  k  =  2  and  5  =  {(0, 0, 0),  (1, 1, 0)} 
where  /i((0,0,0))  =  5,  /2((0,0,0))  =  8,  /i((l,l,0))  =  5, 
and  /2((1, 17O))  =  9.  Assume  that  Algorithm  1  is  performed 
on  the  elements  from  5  resulting  in  the  IBF  hs-  Then,  hs 
would  appear  as  shown  in  Figure  1. 

We  have  the  following  corollary. 
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Fig.  1.  IBF  hs  when  S  =  {(0,  0,  0),  (1, 1,  0)} 


The  algorithm  thus  proceeds  by  successively  searching  for 
positions  where  he  is  equal  to  ±1  and  removing  the  elements 
(as  described  in  the  previous  two  paragraphs)  from  either  hsj^ 
or  Hsb  until  both  hs^  and  hss  are  empty.  The  details  are 
provided  in  Algorithm  2. 


Corollary  1.  The  hash  hs  contains 

d{k  +  1)  (log2(g)  +  k\og2{d{k  +  1))) 
bits  of  information. 

Proof:  According  the  encoding  procedure,  the  hash  hs  is 
comprised  of  two  fields.  The  first  field  (c  field)  requires  log2(g) 
bits  of  information  and  the  second  field  (/  field)  requires 
k\og2{n{k  +  1))  bits  of  information.  ■ 

In  the  next  subsection,  we  show  how  compute  Sa  ^  Sb 
from  hs^ ,  hs^  • 


B.  Decoding 

In  this  subsection,  we  enumerate  the  decoding  procedure 
for  the  C2SS-BF  method  performed  on  Host  B  or  Host  A 
given  hs^jhsB-  For  the  purposes  of  this  section,  we  assume 
the  local  host  is  Host  B',  however  the  algorithm  is  identical 
for  the  case  where  the  decoding  takes  place  on  Host  A. 

We  now  explain  in  words  the  decoding  algorithm.  Let 
hss-c  =  (/i5b[1].c,/i5^[2].c,  ...,/i53[n].c)  and  hs^.c  = 
{hsA  [l]-c,  hs^  [2].c, . . . ,  Hsa  [n].c)-  To  determine  Sb\Sa,  we 
first  compute  the  vector  he  =  hs^-c  —  hs^-c. 

We  now  consider  the  following  scenario.  Suppose  x  G  5^  H 
Sb^  If  X  hashes  to  cell  i  in  hs^  (that  is,  if  there  exists  some 
j  G  [k],  where  fj{x)  =  i),  then  since  Host  A  and  B  have 
the  same  hash  functions,  x  would  also  be  hashed  to  cell  i  in 
hsj^  since  x  e  Sa-  Thus,  any  increments  to  the  vector  hs^-c 
caused  by  cc  G  Sb  are  canceled  out  in  he  since  the  same 
increments  are  made  to  the  vector  hs^-c  since  x  e  Sa- 

From  the  previous  paragraph,  if  cell  i  in  he  has  a  value 
+1  it  follows  that  one  of  the  elements  from  Sb  that  hashed 
to  cell  i  in  hs^  is  from  the  set  Sb  \  Sa-  Let  y  ^  Sb  \  Sa 
be  an  element  that  hashed  to  cell  i.  In  this  case,  we  produce 
an  estimate  fory  E  Sb\Sa  by  finding  an  element  y  E  Sb 
where  f^{y)  =  We  will  show  in  Appendix  A  that 

with  high  probability,  y  =  y.  If  an  element  y  can  be  found, 
then  we  proceed  by  removing  the  contribution  of  y  from  hs^  • 
In  other  words,  we  decrement  all  the  c  fields  for  cells  where 
y  hashed  to  and  we  subtract  f^{y)  from  all  the  I  fields  for 
cells  where  y  hashed  to. 

Now  suppose  cell  i  in  he  has  a  value  —1.  From  the 
discussion,  it  follows  that  one  of  the  elements,  say  y'  E  S a, 
that  hashed  to  cell  i  in  hsj^  is  such  that  y'  E  Sa\Sb-  Since  we 
assumed  the  decoding  is  being  performed  on  5^,  we  do  not 
produce  an  estimate  of  y' .  However,  with  high  probability, 
we  have  that  f^{y')  =  hsj^[i]d.  Since  f^{y')  contains  the 
locations  y'  hashed  to,  we  proceed  in  this  case  by  removing 
the  contribution  of  y'  from 


Algorithm  2:  C2SS-BF  Decode 

input  \  Sb,  hs^,  hs^ 

output:  An  estimate  T  of  Sb\Sa 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 


^  =  1; 

while  ^  <  n  do 

he  =  hsB’C-hs^.c\ 
if  hef]  =  1  then 

if  3\y  e  Sb  ■  f'iy)  =  hss[i]-^  then 
Add  y  to 
fori  =  l:kdo 

hsB{My)]-c  =  hsB[fiiy)]-c  -  1  mod 

hsB  [My)U  =  hsB  [/i(y)]-^  -  /''(y) 

mod  n; 

end 

end 

else 

I  STOP.  A  decoding  error  occurred, 
end 

l  =  f)\ 

end 

else  if  hef]  =  -1  then 

=  hsA^]k 
for  i  =  1  \  k  do 

hsA  [Ji]  ■c  =  hsA  [ji]  -c  -  1  mod  g; 
hsA  =  hsA  [ji]-l  -  Ui,h,  ■  ■  • ,  jfe) 
mod  n; 


23  end 

24  i  =  0; 

25  end 

26  i  = 

27  end 

28  If  he  does  not  contains  all-zeros,  then  a  decoding  error 
has  occurred. 


We  note  that  for  every  element  y  E  Sa^Sb,  Algorithm  2 
requires  that,  in  order  to  produce  the  estimate  y,  we  have  to 
search  through  the  elements  in  Sb-  Assuming  the  elements  Sb 
are  sorted,  each  search  operation  would  require  O  (log (15^1) 
operations,  and  so  the  total  complexity  of  Algorithm  2  is 
0{d\og{\SB\)-  Furthermore,  as  described  in  the  theorem  be¬ 
low,  the  probability  of  incorrect  synchronization  is  0(d“^+^). 
The  proof  of  the  theorem  is  included  in  Appendix  A. 

Theorem  1.  If  |5b|  <  k - - and  q  <  k  \og^{d)  + 

log2(l~  +  ) 

e  —  1,  then  with  probability  0{d~^^‘^),  the  output  of  Algo¬ 
rithm  2  is  such  that  J-  Sb\  Sa- 
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In  the  next  section,  we  present  simulation  results  illustrating 
some  properties  of  the  C2SS-BF  method. 

V.  Simulation  Results 

In  this  section,  we  evaluate  the  C2SS-BF  method  against 
the  set  reconciliation  algorithms  from  [9],  [13].  We  assumed 
that  there  were  two  hosts  A  and  B  where  Host  A  has  access 
to  the  set  Sa  ^  and  Host  B  has  access  the  the 

set  Sb  C  GF(2)  ^  200,  \Sb\  =  200.  We 

chose  to  test  the  synchronization  of  bit-strings  of  size  131072 
(16  kilobytes).  The  choice  of  16  kilobytes  was  motivated 
by  the  setup  where  two  databases  are  synchronizing  their 
pages  (which  are  usually  between  4KB  and  32KB  [3]).  We 
then  tested  the  performance  of  the  algorithms  for  varying 
sizes  of  d  =  |5a  A  For  every  value  of  d  from  the 
set  {10, 20, 30, 40,  50, 60,  70, 80},  we  ran  10, 000  trials  where 
we  attempted  to  synchronize  the  sets  Sa^Sb  given  that 
I  5a  A  5b  I  =  d.  In  other  words,  for  every  trial  we  attempted 
to  compute  5a  U  5b  on  both  Host  A  and  Host  B. 

We  used  the  CBF  from  [9]  in  a  manner  analogous  to  the 
usage  of  the  IBF  in  the  C2SS-BF  method.  In  particular,  a  CBF 
from  [9]  was  used  to  determine  the  set  difference  5a  \5b  on 
Host  A  and  a  CBF  was  used  to  determine  the  set  difference 
5b  \5a  on  Host  B.  Then  5a  \5b  was  sent  from  Host  A  to  B 
and  similarly  5b  \  5a  was  sent  from  Host  B  to  Host  A.  For 
the  results  shown  in  Figures  2  and  3,  we  constructed  CBFs 
of  the  following  lengths:  1)  65100  2)  303600  3)  492800  4) 
2077200  5)  2698800  6)  3338400  7)  3993300  8)  4661100.  The 
CBF  of  length  65100  was  used  for  data  reconciliation  when 
d  =  10;  the  CBF  of  length  303600  was  used  to  reconcile  data 
when  d  =  20,  and  so  on.  The  CBFs  constructed  consisted  of 
simply  an  array  of  cells  containing  binary  numbers. 

In  Figure  2,  we  plotted  the  error  rates  for  the  C2SS-BF 
method  and  the  approach  from  [9]  for  varying  values  of  d. 
We  assumed,  for  the  purposes  of  the  C2SS-BF  method,  that 
d  was  known  and  that  k  =  A.  Since  the  approach  from  [13] 
is  exact,  the  probability  of  correct  synchronization  is  1  and  so 
no  data  is  present  for  the  polynomial  interpolation  approach 
described  in  [13]  in  Figure  2.  It  can  be  seen  from  Figure  2 
that  as  d  increases,  the  probability  of  incorrect  synchronization 
for  the  C2SS-BF  method  decreases,  which  is  consistent  with 
the  analysis  from  the  previous  section  (since  the  probability 
of  incorrect  synchronization  is  0{d~^^‘^)).  Such  a  trend  did 
not  seem  to  hold  for  the  approach  from  [9]  even  though  the 
size  of  the  CBF  was  increased  for  larger  values  of  d. 

In  Figure  3,  we  plotted  the  total  number  of  bits  that  were 
sent  between  Hosts  A  and  B  using  the  C2SS-BF  method, 
the  approach  from  [9],  and  the  approach  from  [13].  As  a 
frame  of  reference,  we  also  plotted  a  lower  bound  of  db. 
The  polynomial  interpolation  approach  from  [13]  (like  the 
approaches  in  [6],  [10],  [12])  require  at  least  2dh  bits  of 
information  exchange  since  these  approaches  only  require  a 
single  round  of  communication.  The  C2SS-BF  method  as  well 
as  the  one  from  [9]  use  two  rounds  of  communication  and,  as 


Fig.  2.  Error  Rates  for  Synchronization  Algorithms 


a  result,  these  approaches  reduced  the  total  throughput  given 
our  test  scenario. 

From  Figures  2  and  3  it  can  be  seen  that  the  C2SS-BF 
method  has  a  lower  probability  of  incorrect  synchronization 
and  it  requires  less  throughput  than  the  approach  in  [9].  This 
lower  probability  of  incorrect  synchronization  is  a  result  of 
using  an  IBF  instead  of  a  CBF.  We  note  that  in  addition  to 
requiring  the  transmission  of  fewer  bits,  the  C2SS-BF  decoder 
has  complexity  0((imax(log  |5a|,  log  |5b|)  whereas  the  CBF 
approach  in  [9]  had  decoding  complexity  0(|5a|  +  I^^bD- 
Recall  the  method  from  [13]  has  decoding  complexity  0{d^), 
which  in  many  cases,  renders  it  impractical. 


x10 


Throughput  Comparison 


“  +  “  Counting  Bloom  Filter  [8] 

-  -  C2SS-BF  Method 

“  •  “  Polynomial  Interpolation  [12] 
~  H  “  Theoretic  Lower  Bound 


^9  ^ 
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Fig.  3.  Total  Throughput  for  Synchronization  Algorithms 


VI.  Conclusion 

In  this  work,  we  considered  an  algorithm  for  synchronizing 
similar  sets  of  data.  In  particular,  we  considered  an  approach  to 
the  set  reconciliation  problem,  known  as  the  C2SS-BF  method, 
which  requires  only  two  rounds  of  communication.  It  was 
demonstrated  that  the  C2SS-BF  method  has  the  potential  to 
reduce  the  throughput  as  well  as  computational  complexity  of 
many  alternative  schemes  in  the  literature. 

We  note  that  one  potential  limitation  of  the  C2SS-BF 
method  is  that  the  C2SS-BF  method  requires  an  upper  bound 
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ford=  |5a  A  5b|.  Thus,  additional  communication  may  be 
necessary  to  produce  accurate  estimates  for  d.  However,  if 
accurate  upper  bounds  for  d  can  be  determined,  the  C2SS-BF 
method  can  significantly  reduce  the  rounds  of  communication 
required  to  determine  5^  U  5^  on  either  Host  A  or  Host  B. 
Future  work  involves  incorporating  our  algorithm  into  future 
releases  of  C2SS  and  providing  mechanisms  that  estimate  the 
symmetric  difference. 
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Appendix  A 
Proof  of  Theorem  1 

In  this  section,  we  consider  the  probability  the  decoding 
algorithm  described  in  Section  IV  fails.  There  are  3  possible 
scenarios  under  which  the  decoding  algorithm  would  fail. 
First,  Algorithm  2  can  fail  at  step  14  if  there  is  any  element  in 
Sb  that  hashes  to  all  the  same  cells  as  an  element  in  5a  A  5^. 
Second,  the  decoding  can  fail  if  a  cell  is  hashed  to  q  or  more 
times  by  an  element  in  Sa  ^  Sb-  Finally,  the  decoding  can 
fail  at  step  28  if  the  following  scenario  holds:  Suppose  S'  is 
a  subset  of  5a  A  5^  and  suppose  C  is  the  set  of  all  cells 
hashed  to  by  the  elements  of  5'.  Then,  Algorithm  2  can  fail 
if  for  any  i  C  jC,  there  are  at  least  two  elements  in  5'  hash  to 
i.  The  three  conditions  are  stated  mathematically  below: 

1)  3x  e  Sb,  3y  g  5a  a  Sb  where  f'‘{x)  =  P{y). 

2)  1  <  i  <  n  where  \{{x^j)  \  x  C  Sa  AsSb^J  ^ 

k],fj{x)  =z}|  >  q. 

3)  Suppose  5'  C  (5a  A  5^)  and  C  =  {fi{x)  : 
i  G  {1, 2, . . . , /c},  X  G  5'}  and  G  C,3x,3y  G 
S',  3i,  3j  e{l,...,k}  where  fi{x)  =  fj{y). 


In  the  following,  we  refer  to  the  first  event  listed  as  item  1) 
above  as  ^i,  the  second  event  as  ^2.  and  the  third  event  as  ^3. 
For  any  event,  we  let  P(^)  denote  the  probability  the  event 
^  occurs.  Let  ^  denote  the  event  that  P  /  5a  A  5^  where 
T  is  computed  according  to  Algorithm  2.  Then,  by  the  union 
bound  we  have 

A0<  A6)  +  A6)  +  UC3).  (1) 

We  begin  with  the  following  lemma. 

Lemma  1.  P(^i)  <  d{l  - 

Proof:  For  any  y  e  Sa  Sb,x  g  Sb,  P{f^{x)  = 

fiy))  =  (d{ifT))  =  dnf+W-  Therefore,  for  y  e  Sa  A 
Sb, 

uk 

PiAc  e  Ss  :  /‘(X)  =  fiv))  =  (1  - 

Then,  since  \Sd\  =  d  we  have  P{^i)  <  d(l  - 
as  desired.  ■ 

The  following  corollary  follows  from  Lemma  1. 

Corollary  2.  If  |5b|  <  then  P(6)  < 

We  next  bound  the  probability  of  the  P(^2)- 

Lemma  2.  If  q  <  k\og^{d)  —  then  P(^2)  <  0{d~^~^‘^). 

Proof:  For  some  integer  i  where  1  <  i  <  d{k  1)  and 
any  x  c  Sa  ^  Sb,  let  A’  be  a  random  variable  that  is  equal 
to  1  when  fj{x)  =  i  and  0  otherwise  for  j  =  |~  d(k\i)  1-  Then 

A’  is  a  Bernoulli  random  variable  with  parameter  p  =  . 

Let  4a  be  a  a  random  variable  that  has  value  |{x  G 
5a  a  5^,7  =  {1,  2, . . . ,  /c}  :  fj{x)  =  i}|.  Notice  that  since 
A'  is  a  Bernoulli  random  variable,  4^  is  a  Poisson  random 
variable  with  mean  A  =  .  Applying  the  Chernoff  bound 

(c.f.  [5]),  which  states  that  P(Ai  >  cl)  <  where 

denotes  the  moment  generating  function  for  4^,  gives 
P(4a  ^  q)  ^  g-tggA(e*-i)  Letting  t  =  1  and  substituting 
q  =  k  logg((i)  +  e  —  1,  gives  that 

PAi  >q)< 

ge  i 

<  ed-\ 

Since  there  are  d{k  +  1)  possibilities  for  the  index  i,  the 
probability  that  there  exists  an  i  where  \{x  G  Sa  As  Sb^J  = 
Iwiki)]  •  fji^)  =  ^}\  >  Q  is  at  most  {k  +  l)d  •  ed-^  = 

as  desired.  ■ 

Finally,  we  bound  the  probability  of  the  third  event  in  (1). 

Lemma  3.  P(^3)  <  0{d~^~^‘^). 

Proof:  The  probability  of  such  an  event  occurring  is 
equivalent  to  the  probability  a  2 -core  exists  in  a  hypergraph 
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(see  [8]).  This  probability  was  shown  to  be  at  most 

m 

Combining  Equation  1  along  with  Lemmas  1,  2,  and  3,  we 
have  the  result. 

Theorem  1.  ^  |<Sb|  <  A:,  _  ;  and  q  <  k\og^{d)  + 

e  -  1,  then  P{C)  <  0{d-^+‘^). 

Appendix  B 

Analytic  Comparison  of  Proposed  Approach  with 
Existing  Approaches 

In  this  section,  we  provide  an  analytic  comparison  of  the 
approach  proposed  in  this  paper  with  existing  approaches  to 
set  reconciliation.  We  begin  by  considering  the  properties  of 
our  set  reconciliation  algorithm  (as  described  in  Section  IV). 
Recall  Sa^Sb  C  GF{2)^  and  the  size  of  the  symmetric 
difference  is  d  =  \Sa  A  5b|  =  \{Sa  \  Sb)  U  {Sb  \  <5^)1  • 
Let  /c  be  a  positive  integer  and  suppose  t  =  \Sb\  >  I>5a| 
and  t  <  k ^  .  where  e  is  the  base  of  the 

natural  log.  Our  proposed  approach  requires  approximately 
2d{kF  1) (log2  {k  logg((i)  +  e  —  1)  +  /c  log2 {d{k^l)))-{-dh  total 
bits  of  information  exchange  and  the  probability  of  incorrect 
synchronization  (or  the  probability  the  algorithm  fails)  is 
0{d~^^‘^).  Eor  the  case  where  t^d  «  2^,  our  approach 
requires  approximately  db{l  +  ^)  total  bits  of  information 
exchange  where  0  <  ^  <  1  so  that  our  approach  is  close  to  the 
optimum  total  number  of  bits  that  must  be  exchanged  which 
is  db.  Eurthermore,  our  approach  has  decoding  complexity 
0{dlog{t))  and  encoding  complexity  0{t). 

Recall  that  our  algorithm  requires  an  additional  round  of 
communication.  As  a  result,  in  many  cases  our  algorithm 
reduces  the  throughput  required  in  both  [6]  and  [13].  Recall 
that,  if  the  methods  from  [6]  and  [13]  are  used,  then  the 
throughput  is  at  least  2Kdb  for  some  positive  integer  K. 

If  the  approach  from  [9]  is  used  with  a  Counting  Bloom 
Eilter  (CBE)  of  size  d{k  +  1)  (which  is  the  size  of  the  Bloom 
Eilter  used  in  our  approach),  then  from  equation  (8)  in  [9] 
we  have  that  the  probability  of  incorrect  synchronization  is 

approximately  0(1  -  (1  - 

which  is  much  larger  than  0{d~^~^‘^)  for  small  k  and  large  d. 
In  addition,  the  CBE  method  in  [9]  has  decoding  complexity 
0(|5a|  +  \Sb\)  which,  for  the  case  where  |5a|  +  \Sb  \  is  large, 
may  be  prohibitively  expensive. 


